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Abstract — This paper introduces a new method for learning 
and inferring sparse representations of depth (disparity) maps. 
The proposed algorithm relaxes the usual assumption of the 
stationary noise model in sparse coding. This enables learning 
from data corrupted with spatially varying noise or uncertainty, 
typically obtained by laser range scanners or structured light 
depth cameras. Sparse representations are learned from the 
Middlebury database disparity maps and then exploited in a 
two-layer graphical model for inferring depth from stereo, by 
including a sparsity prior on the learned features. Since they 
capture higher-order dependencies in the depth structure, these 
priors can complement smoothness priors commonly used in 
depth inference based on Markov Random Field (MRF) models. 
Inference on the proposed graph is achieved using an alternating 
iterative optimization technique, where the first layer is solved 
using an existing MRF-based stereo matching algorithm, then 
held fixed as the second layer is solved using the proposed 
non-stationary sparse coding algorithm. This leads to a general 
method for improving solutions of state of the art MRF-based 
depth estimation algorithms. Our experimental results first show 
that depth inference using learned representations leads to state 
of the art denoising of depth maps obtained from laser range 
scanners and a time of flight camera. Furthermore, we show 
that adding sparse priors improves the results of two depth 
estimation methods: the classical graph cut algorithm |1| and 
the more recent algorithm of Woodford et al. (2j. 

Index Terms — Sparse approximations, dictionary learning, 
depth denoising, depth from stereo. 



I. Introduction 

Finding efficient representations of depth or disparity maps 
is important for applications involving inverse problems such 
as depth denoising and inpainting (for example in view syn- 
thesis pj), and depth map compression (e.g., in 3DTV HI). 
Inverse problems have been extensively studied for natural im- 
ages, for example using wavelet representations |5|. However, 
because of differences between image and depth statistics, 
it is not obvious that wavelets are the most efficient way 
to represent the structure of depth maps. Thus, we prefer 
to learn an efficient representation from a large database 
of examples. Sparse coding [6] uses this approach to find 
overcomplete dictionaries of waveforms (atoms) in which the 
data has a sparse decomposition. Sparse coding and other 
dictionary learning techniques have been successfully applied 
to learning image (6|-|TT| and audio |T2| representations. 
Imposing sparse, non-Gaussian priors over latent variables 
in a linear generative model leads to a learning rule which 
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produces dictionaries with elements that capture non-trivial 
aspects of the data statistics, such as long-range spatial corre- 
lations. These learned dictionaries, which capture higher-order 
dependencies in the data, can be used to regularize methods for 
solving inverse problems, yeilding state-of-the art performance 
in denoising Q3). 

Most algorithms for dictionary learning assume that the 
signal is corrupted by stationary additive white Gaussian noise. 
While this is often a valid assumption in natural images, it does 
not hold for depth data. Even when measuring depth directly 
with range scanners, noise varies locally due to the different re- 
flection of scanner light pulses around transparent or reflective 
surfaces, or near boundaries. Likewise, estimation of disparity 
from stereo images using standard computer vision algorithms 
yields disparity maps with variable uncertainty at each pixel in 
the map. Therefore, learning representations of depth requires 
adaptation of learning algorithms in order to deal with non- 
stationary noise in depth maps or with the unreliability of 
disparity map estimates. One contribution of this paper is a 
new learning algorithm based on sparse coding that is able 
to cope with non- stationary depth estimation errors. Noise 
statistics are inferred along with sparse coefficients during the 
inference step, which are then passed to the learning step that 
properly incorporates this uncertainty into the adaptation of the 
dictionary. This allows the dictionary learning method to be 
spatially adaptive and robust to noise. We show that this sparse 
coding method gives state-of-the-art performance in denoising 
of depth maps. 

Learned representations of disparity are also important 
priors in depth estimation from stereo images, which is still 
a challenging problem in computer vision and robotics. Our 
second contribution is a new stereo matching algorithm that 
exploits the sparse prior over the learned depth atoms, which 
allows for modeling higher-order dependencies in depth map 
data. Such higher order priors encompass more information 
about the 3D structure than smoothness priors typically used 
in computer vision. We define stereo matching as a maximum 
aposteriori depth estimation problem on a two layer graphical 
model, where the top layer consists of hidden units that repre- 
sent the coefficients in the sparse code of the depth map. The 
middle layer incorporates the output of the upper layer into a 
Markov Random Field (MRF) with neighborhood smoothness 
constraints. The probabilities for each depth estimate given by 
the sparse priors are used to refine the input to the MRF that 
can be defined and solved using any of the existing MRF- 
based stereo matching algorithms. Therefore, the proposed 
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approach represents a generic way to include higher order 
priors to existing MRF-based algorithms in order to improve 
their solutions. Our final experiments on Tsukuba, Cones and 
Teddy datasets p4| demonstrate that sparse priors can be used 
to regularize depth map estimates and quantitatively improve 
stereo matching results of the standard graph-cut algorithm 
(GC) |1] and the more recent second order prior algorithm 
(20P)@. 

The paper is structured as follows. In Section [TTJ we 
formulate the new sparse coding method using a generative 
model with non- stationary noise, and we present its energy 
minimization solution in Section [Hi] Section [IV] describes 
depth inference from stereo based on sparse priors over 
learned disparity dictionaries. Experimental results in depth 
learning, denoising and inference from stereo are presented in 
Section 

II. Sparse coding with non-stationary additive 
Gaussian noise 

The main principle underlying sparse coding (also called 
dictionary learning) theory is that some signals of dimension 
N are well represented by a linear combination of a small 
number of elements selected from an overcomplete dictionary 
V of size K > N. This principle can be captured by the 
formalism of a linear generative model f = 3? a + e. Columns 
of the matrix <I> represent atoms from V, a is the vector of 
coefficients that weights each atom, and e is an TV-dimensional 
vector of i.i.d. Gaussian noise of mean zero and variance (Jq. 
Since the dictionary is not given a priori, the main challenge of 
sparse coding is to learn the atoms in the dictionary given a set 
of training signals. Most approaches for dictionary learning are 
based on maximum likelihood estimation, where we look for 
the 3? that maximizes P(f |3>) (6), or the maximum aposteriori 
solution that maximizes P(3>|f) (T5J. 

Previous approaches for sparse coding assume that the 
additive noise e represents the portion of the signal that cannot 
be accounted for by the model, and it is usually modeled by an 
i.i.d. Gaussian noise process. However, depth maps acquired 
using laser range scanners and structured light contain noise 
that has spatially varying statistics. To account for this type 
of noise, we propose the following generative model for the 
depth map f : 

f = $a + e + 77, (1) 

where r\ captures spatially varying noise from the sensor. We 
assume that this noise has a multivariate Gaussian distribution 
of zero mean and covariance matrix E^, i.e., 77 ~ A/at(0, X)^). 
Since we have a sum of two Gaussian noises, the total noise 
( = e + 77 also has a Gaussian distribution A/at(0, XI), where 
T, = T, v + ctqI. This covariance matrix X) represents a set of 
variables that we need to infer along with the coefficients a. 
In the case of sparse coefficient vectors a, our optimization 
problem is: 

<E>* = argmaxP(f |<3>) 



arg max 



maxP($|f , a, £)P(a)P(£) 



(2) 



In the rest of the paper, we consider only independent (and thus 
uncorrelated) external noise, such that T,^ = diag(af , a%), 
and therefore we have: T, = diag (of where of = 
ctq + of, for i = 1, N. The case of correlated noise is also 
interesting, but outside the scope of this paper. 

The conditional probability of f , given a, 3? and T, in this 
case is: 



P(f|*,a,E) 



exp 



■±(f - $a) T 5]- 1 (f - $a) 



N 

n 
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(3) 



where fa and fa are i-th entries of vectors f and f = <l>a, 
respectively. For the prior on the coefficient vector a we take 
a Laplace distribution, which is peaked at zero and heavy 
tailed. This is a usual choice in most sparse coding methods. 
Therefore, we have: P(a) ex exp (— A||a||i), where A controls 
the sparsity of a. 

Unlike in previous dictionary learning methods, we impose 
a hyperprior on the covariance matrix of the noise in our 
model. Because we assume the noise at each location is i.i.d., 
our hyperprior is factorial, i.e., P(S) = n^li^^)- 

We do not know what shape this noise hyperprior should have, 
so we choose the non-informative Jeffreys prior for the noise 
variance at each depth sample i. The Jeffreys prior on the 
variance of normal distribution is simply 7^7, which gives: 



N N 

p ( £)=n^)=n^ 



(4) 



2=1 



i=l 



With the defined priors, our optimization problem in Eq. |2j 
cast as an energy minimization problem, becomes: 



<i>* = argmin£^(f |<3>, a) 



arg mm 



N 



min >^ 



log CT- 



2 1 (fi fi) 2 



2a 



+A||a||i 



(5) 



where E(t\&) = — logP(f |3>, a). The following section 
explains how to minimize this energy function. 

III. Inference and learning in the sparse coding 

MODEL WITH NON- STATIONARY NOISE 

Both inference and learning are accomplished by minimiz- 
ing the negative log probability of the data under the model. 
This is done via an alternating optimization technique that 
can be viewed as a variational approximation to the E-M 
algorithm. At the beginning of each iteration, we select a depth 
map patch and the pixel-wise noise variances are initialized to 
fairly large values. The inference consists of two parts. The 
energy is first minimized with respect to the coefficients. Next, 
with these coefficients held fixed, we minimize the energy with 
respect to the pixel- wise noise variances. These two steps are 
alternated until convergence, which usually happens in only a 
few iterations. Finally, in the learning step, we compute the 
gradient of the energy function with respect to the dictionary, 
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using the inferred coefficients and noise variances, and take 
a small step in that direction (learning). The details of this 
scheme are described below. 

A. Inference 

Our inference step differs from the usual convex optimiza- 
tion done in sparse coding since we need to infer the variances 
of the noise at each depth sample. In Section [TTJ we have seen 
that the total noise has two components: 1) the approximation 
noise e with variance (Jq that is equal for all depth samples, 
and 2) the external noise rj with variance a 2 that differs at 
each sample. In the inference step, we will assume that cfq is 
fixed and we optimize with respect to cr*. Note that this will 
not significantly influence the obtained results since we can 
put a small value for ctq and all the noise variability will be 
shifted to cr^. However, it will ensure that &i ^ and that the 
solution is stable. The inference step is then: 

N 

(a, {a,})* = arg min log (a 2 + cr?) 



9 2^ „?J-Z? 



A||a||i], (6) 



where <pj, j = 1, ...,K are atoms from V. We perform this 
optimization in two alternating steps. First, we fix a large value 
for all cr^'s and optimize with respect to a. This case is the 
regular 12 — 11 optimization, where the cr^'s may all be different 
constants. In the second step, we fix a and use a closed form 
solution for hyperparameters c^'s: 
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-(E- 1 (f-*a*)a* T ). 



(7) 



where fi = J^ . aj<j)j(i). Each step is guaranteed to de- 
scend the energy function. In practice, convergence is usually 
achieved in the first few iterations. 

B. Learning 

Learning is accomplished by taking a small step in the 
direction of the negative gradient of the energy function with 
respect to the dictionary <E>: 

dE 

Since we select different depth map patches at each iteration, 
this effectively averages learning gradients drawn from the 
entire data set, denoted by (•). The learning rule differs from 
that of standard sparse coding in that the dictionary update 
is weighted by the inverse of the noise covariance matrix. 
This results in adaptive dictionary updates, where observations 
with smaller noise variance (more reliable ones) have a higher 
influence on the learning than the observations with high noise 
variance (unreliable ones). 

In the rest of the paper, for convenience we will refer to 
our sparse coding method with non- stationary noise as ns-SC. 
It is important to note that ns-SC can be used not only for 
learning from depth maps, but also in the following general 
cases: 



1) when we learn from data acquired by sensors that 
introduce non- stationary noise; 

2) when we learn from inferred data, where each inferred 
variable has a certain reliability (e.g., learning in layered 
models). 

The second case is certainly a very important one, as there are 
many examples encountered in nature where we need to learn 
or infer the states of some hidden variables from other inferred 
variables. A relevant example is depth inference from stereo 
using sparse priors, which we propose in the next section. 

IV. Stereo matching with sparse priors 

The dense depth estimation problem is usually formulated 
as a MAP estimation problem. Given the left and right images 
L and R respectively, we want to estimate a disparity map 
f. In other words, to each pixel i in one of the images 
(the one that we choose as a reference) we need to assign 
a certain disparity value fi. We propose an approach that 
combines the Markov Random Field (MRF) formulation of 
depth inference, commonly used in computer vision, and 
inference using higher-order sparse priors. We briefly review 
the MRF approach and then describe the proposed method for 
depth inference using sparse priors. 

A. MRF approach to depth inference 

Most state of the art depth estimation approaches in com- 
puter vision formulate depth inference by the following opti- 
mization problem: 



f* 



argmaxP(f |L, R) 



argmaxP(L,R|f)P(f), 



(8) 



where P(L, R|f ) is the data likelihood, and P(f ) is the joint 
prior for disparity variables fi. The likelihood term is usually 
modeled with a factorial Gaussian distribution: 



N 



P(L,R|f) ex ]Jexp 



(it 



exp 



N 



2p 2 

D(fi) 
2p 2 



(9) 



where Li is the value of left image at pixel i, Ri+f { is the 
value of the right image at pixel j displaced from pixel i by fi, 
and p 2 is the stationary noise variance. In computer vision, the 
disparity fi is usually one-dimensional since the stereo images 
are rectified. However, in general one can also consider two 
dimensional disparities. The function D(fi) = (Li — P i+ j.) 2 
is usually called the data consistency term. 

When the depth map is modeled by a Markov Random Field 
(MRF), the prior over disparities can be expressed as: 



P(f) ex exp 



£w 



exp 



_ cec 

£ 



Vi(fi)- £ v 2 (f i j j )-... 



(10) 



1 Since there is a unique mapping from disparity values to depth values, 
inferring disparity is an equivalent problem to inferring depth. 
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where C is a set of cliques, and the V c are clique potentials |T6| 
of first, second and higher order. A particularly interesting case 
is when the cliques are at most of order two, such that the 
prior includes pairwise correlations between disparity variables 
at neighboring nodes. In this case, the disparity estimation 
problem is: 



arg mm E(i | L, R) 



arg mm 



E 



D{h) 
2p 2 



■£W0 



E v 2 (f i ,f j ) 

(11) 



where V\ and V2 are first and second order cliques, which can 
be defined in a number of different ways depending on the task 
at hand. Mi denotes the neighborhood nodes of node i. Such 
energies can be efficiently minimized by graph cut |1|, belief 
propagation (T7), log-cut fl8| , etc. When the first order cliques 
V\ are equal (no preferred disparity), the energy function in 
Eq. ( pT) reduces to the one used in most computer vision 
algorithms. Second order cliques incorporate the smoothness 
constraint, and they can be evaluated as absolute distance be- 
tween disparities of neighboring nodes, or by the Potts energy 
that puts more weight on neighboring that disparities differ. 
However, a pairwise model such as §TT\ cannot incorporate 
higher order structure such as depth edges. Although the MRF 
model can be extended to include triplewise correlations (2), 
including even higher order priors in a single layer leads to 
high complexity graphs, which in general cannot be optimized 
by the graph cut (19). 

Another approach to modeling higher order dependencies is 
via a sparse prior over a dictionary adapted to the structure of 
signals. Such an approach has proven successful in natural 
images, where a sparsity prior on a dictionary of oriented 
edges solves inverse problems such as denoising and inpaint- 
ing (6), p3) , |20| . We expect that such priors would also 
play a crucial role in solving the correspondence problem, 
and thus we approach the problem of depth inference by 
including a sparsity prior on local depth features learned by the 
algorithm proposed in Section [TTJ The details of our solution 
are described in the next section. 

B. Depth inference using sparse priors 

We propose to combine the MRF structure with a sparse 
coding network within a two layer graphical model shown 
in Fig. [T] The bottom layer is made up of the left and right 
input images. The middle layer is modeled as an MRF (as 
previously described), where each node consists of two latent 
variables: depth estimates fi and the reliability of each depth 
estimate given by c^. The top layer consists of latent sparse 
coefficients aj that capture higher order depth dependencies. 
The depth inference problem is cast as: 

f* = argmaxP(f|L,R,£!,a) 

= arg max P(L, R|f , a)P(f |S, a)P(£)P(a), (12) 

which has two additional variables with respect to Eq. [8j 
the covariance matrix T, of depth noise (i.e., the variance 



(fh<Ti) 




layer 2 



layer 1 



R 



Fig. 1. Two-layer graphical model for depth inference. 



or reliability of each depth estimate <7^), and the sparse 
coefficients a that represent hidden units. We propose to solve 
this problem by alternating inference in each of the layers 
separately, i.e., the algorithm iterates between the estimation 
of f in the middle MRF layer and the inference of a and X) 
by the sparse priors in the top layer. 

In inferring the middle layer, a and T, are fixed and f is 
inferred as: 

f * = arg mm E(f\L, R, a) 
. ^ D(fj) 



E 



2<r 2 



(13) 



where fi is an element of the vector f = 3? a. Since a and X) 
are constant in this layer, priors P(a) and P(5]) vanish from 
the energy function. This problem is similar to the inference 
problem in Eq. [5j with two differences: a) the data term 
variance p depends on the variance of depth estimates of and 
is different for each fi\ and b) clique potentials depend on a 
and of and are given as Vi(f) = (fi — / i ) 2 /(2a|). The data 
term variances [p(c^)] 2 can be estimated by their expected 
values around fi under the noise variance . We use a square 
window (fi — <Ji, fi + <Ji) to calculate this expectation, i.e.,: 

\p(cy % )f = ((D(fi) - D(h)f), Vf e (f - a u f + a,). 

(14) 

Once we have inferred f in the middle layer, the inference 
in the upper layer becomes: 



(a, {crj)* = arg min £(a,E|f) 

(a,{cTi}) 



arg mm 

(a,{a*}) 



E 



{h-h? 

2a? 



a i 



(15) 



where the hidden units are inferred by the ns-SC from the 
disparity estimates obtained by the middle layer. Since ns- 
SC evaluates both the mean fi and the variance of, it sends 



5 



feedback to the middle layer to update the states of these 
variables. With new estimates for each variable fa and of, 
the middle layer can re-evaluate the new disparity estimates 
with new clique potentials. 

The main role of describing each node in the MRF by a 
Gaussian with mean fa and variance of is to resolve ambigu- 
ities in stereo matching. Namely, when the data likelihood 
at a certain point is unreliable (p(o^) is large), the stereo 
matching algorithm puts more weight on the prior given by 
V\(fi), which is estimated by the sparse priors from the upper 
layer. This happens, for example, at scene points with specular 
illumination, where the surface is not Lambertian. Since data 
consistency is violated at those points, we need to use prior 
information to solve the correspondence problem. 

V. Experimental results 
A. Learning of depth dictionaries 

We have learned overcomplete dictionaries of depth atoms 
using the regular sparse coding (SC) (6) and ns-SC. For train- 
ing, we have used the ground truth disparity maps from the 
2005 and 2006 Middlebury stereo datasets (74) . Two examples 
of disparity maps are shown in Fig. [2] These maps represent 
"inverse" depth, i.e., the disparity is inversely proportional to 
depth, but keeps the same features (e.g. edges) as depth. It also 
represents "projective depth", because these disparity values 
are dependent on the viewing angle. Finding representations 
for the projective depth is especially desirable in multi-view 
(3DTV) technologies, where an image is usually aligned to a 
depth map in order to simplify view synthesis. On the other 
hand, laser range scanners are typically of different resolution 
and sampling than images, which makes them hard to register. 
Another possibility would be to learn on depth maps from 
time of flight cameras (TOF). However, there are no publicly 
available databases for TOF data. Therefore, we have chosen 
the Middlebury database for learning. No prior whitening has 
been performed. 

Depth maps from the Middlebury stereo dataset were ob- 
tained using the structured light technique and they have 
missing pixels (black pixels in Fig. [2]). We treat those pixels 
in two ways: 

1) Approach 1: set the noise variance at missing pixels 
to infinity (their contribution to learning is thus zero), 
while the variance values of the rest of the pixels are 
inferred during ns-SC learning; and 

2) Approach 2: treat those pixels in the same manner as 
other pixels and perform ns-SC learning (i.e., variances 
of all pixels are inferred). 

The dictionary learned with Approach 1 is denoted as <E>i 
and with Approach 2 as <3>2. The idea behind learning with 
Approach 2 is to see how well the ns-SC algorithm would 
do if it had no information where the missing pixels are, and 
would have to infer that. For comparison, we have also learned 
the dictionary using standard sparse coding with constant noise 
variance within the map [6], except at the missing pixels where 
the variance is infinity (i.e., missing pixels do not bias the 
learning). This dictionary is denoted as $3. 



Within each iteration of ns-SC (and SC as well), we have 
randomly chosen a large set of 16 x 16 patches. The dictionary 
size has been set equal to the signal size (256), in order to 
limit the complexity. Note that overcomplete dictionaries can 
be learned as well, leading to better performance in target 
applications and a higher computational cost. The parameter 
Oq has been chosen to have a small value ~ 0.01. Since 
the sparsity parameter A is subsumed by the variance of the 
noise inferred at each pixel, we set it to 1. Fig. [3] displays 
learned dictionaries <l>i, 3>2 and $3. We can see that <£i 
and $2 are qualitatively similar, with mostly edge-like depth 
functions and some slant-like atoms. These are the types 
of features usually seen in depth maps. The dictionary $3 
learned with SC exhibits some repetitive atoms, meaning that 
learning prefers some directions in the high-dimensional space. 
This might be explained by the fact that learning is done 
on unwhitened data, so some directions have higher energy 
(variance). This does not happen in ns-SC since the variance is 
inferred and the learning rule is adapted accordingly. However, 
it is hard to say which dictionary is the best based only on 
the qualitative assessment. Therefore, in the next section we 
perform quantitative comparison of these dictionaries on depth 
map denoising. 

B. Depth map denoising 

Depth maps obtained by laser range scanners and TOF 
cameras are typically corrupted by spatially varying noise. 
Denoising of these depth maps thus becomes an important step 
in applications that involve view synthesis in hybrid (depth 
plus video) camera systems |2TJ . Depth denoising can be 
achieved with the inference step of the ns-SC algorithm. For 
evaluation purpose we have added synthetic noise to 1% of 
randomly chosen pixels in a depth map block of 100 x 100 
pixels, taken from the Purves natural depth maps database (22) 
(different from the depth maps used for learning). The noise 
has been generated according to the non- stationary Gaussian 
model. Each pixel is corrupted by Gaussian noise whose 
variance is randomly chosen from to 1. We divide a depth 
map into overlapping patches of 16 x 16 pixels (patches are 
shifted by 1) and denoise each patch as: y = <3>a, where 
a is the sparse coefficient vector inferred by ns-SC. Prior to 
inference, each patch has been normalized to have a variance 
1, which facilitates the choice of the initialization of noise 
variances (initialized also to 1). Each pixel in the denoised 
depth map is evaluated as an average over patches that overlap 
at that pixel. 

The original and noisy depth maps are shown in Fig. [4^ and 
Fig. |4j), respectively; while Fig. |4j: shows the added noise. 
The denoised depth map using the ns-SC with <3>i (Fig. |4^) 
is of higher quality compared to the denoised depth map 
using ns-SC and 3> 2 (Fig. [4JT) and ns-SC and $3 (Fig. |4^), 
where the quality is measured by the Peak-SNR (PSNR). 
Besides denoising, ns-SC also estimates the variance of the 
noise, shown in Fig. [4ji, whose spatial distribution (location 
of noisy pixels) corresponds to the noise pattern. Moreover, 
ns-SC using any one of the dictionaries outperforms denoising 
using classical SC with fixed variance 12 — 11 minimization 
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(a) Moebius 

Fig. 2. Examples of disparity maps from the Middlebury 2006 dataset. 



(b) Reindeer 
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Fig. 3. Learned depth dictionaries: a) with ns-SC and masked missing pixels (Approach 1); b) with ns-SC and no masking of missing pixels (Approach 2); 
c) with SC and masked missing pixels. 



and $3, shown in Fig. Rh. We have also performed median 
filtering denoising (Fig. |4J), since it is a proper filter for this 
type of noise; the Total Variation denoising [23 ] using the 
algorithm of Chambolle (24) (see Fig.gJ); the Non-Local (NL) 
means denoising using median filtering [25] (Fig. |4}c), and 
the ns-SC inference using the translation invariant wavelet 7-9 
frame (TIWF, Fig. [4}). Again, ns-SC with any of the learned 
dictionaries 3>i-<I>3 outperforms the other solutions, both in 
PSNR and visual quality. Although the solutions obtained 
by median filtering and NL-means might also look visually 
pleasing, these types of filters average over the fine details 
in the map, which is undesirable. Using TIWF instead of the 
learned dictionaries for ns-SC (Fig.|4|) is also suboptimal since 
the wavelet frame is not adapted to the statistics of the signal. 
We do not report the comparisons with denoising methods 
designed for stationary noise (e.g., KSVD (13), GSM (26) and 
BM3D (27)), since they are not really adapted to this type of 
noise and thus cannot appropriately handle it. 

Fig. [5^ and Fig. [5]) show PSNR versus the % of corrupted 
pixels averaged over five depth maps, and it confirms the su- 
periority of ns-SC with <E>i over ns-SC with other dictionaries 
(Fig. [5^) and over other denoising solutions (Fig. [5J>). As 
expected, ns-SC using a dictionary learned without masking 
the missing pixels (<I>2, Approach 2) performs worse than ns- 
SC using a dictionary learned with masking (<3>i, Approach 1), 



since less information is provided to the learning algorithrrj^] 
Interestingly, ns-SC with 3>2 performs better than SC with 
3>i, which shows the superiority of ns-SC compared to SC. 
Another advantage of ns-SC compared to other denoising 
methods is that we do not need to choose a special value 
for the regularization parameter A since the noise variance at 
each pixel is inferred during denoising. On the other hand, for 
TV denoising and NL means we have chosen the values of A 
or values of the noise variance that yield the best results. 

If we have a depth map that has natural noise due to the 
acquisition process, we can use ns-SC to remove the noise, but 
also to point out the noisy pixels. These would correspond to 
the pixels whose inferred noise variance is high. Fig. [6^ shows 
a noisy laser range scan depth map from the Purves database, 
where the noisy pixels take values within the depth range (i.e., 
they are not marked as missing pixels). Reconstructions of 
this depth map obtained by NL means filtering and ns-SC 
(with $1) are shown in Fig. [6}) and Fig. [6]:, respectively. Both 
methods denoised most of the erroneous pixels, but the NL- 
means filtering also introduced a loss of texture information, 
while ns-SC preserves this information. Moreover, ns-SC gives 
an estimate of the noise variance at each pixel, displayed in 
Fig. [6ti. Interestingly, the indication of noisy pixels can be 



Note that the denoising algorithm does not use pixel masking, the masking 
refers only to the way that the dictionary is learned. 




(a) original 



(b) noisy 28.5dB 



(c) noise magnitude (d) inferred variance 




(e) ns-SC(^i), 37.8 dB (f) ns-SC($ 2 ), 36.3 dB (g) ns-SC (* 3 ), 35.3 dB 



(h) SC, 32.3dB 




(i) median, 31.3 dB 



(j) TV, 31.8 dB 



(k) NL means, 32.9dB 



(1) TIWF, 33.7dB 



Fig. 4. Denoising results for a depth map from the Purves database with synthetic noise (1% of corrupted pixels). Performance of ns-SC using different 
dictionaries 3>i, 3>2 and ^3 and performance of other denoising methods: SC- sparse coding, TV- total variation, Median, Non-Local (NL) means and 
Translation Invariant Wavelet 7-9 Frame (TIWF). The variance inferred using the ns-SC method with is shown on subfigure (d). 
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Fig. 5. Average denoising performance with synthetic noise: PSNR vs the % of corrupted pixels (averaged over five depth maps from the Purves dataset). 
a) Denoising performance comparison of ns-SC using different dictionaries 4>i, 4>2 and ^3. b) Denoising performance comparison of ns-SC (using #1) 
with other denoising methods: SC - sparse coding with ^3; TV - total variation with optimal A; Median filtering; NL means - non-local means; TIWF - 
non- stationary sparse inference with the translation invariant wavelet 7-9 frame. 



8 



used as a mask for inpainting by setting those pixel variances 
to infinity (due to the finite precision of ns-SC, the inferred 
variances are finite). The final inpainted depth map with ns-SC 
is given in Fig. [6^. 

Similar results are obtained on the depth maps captured 
by the PMD Time-Of-Flight (TOF) camera^] These cameras 
capture 2D video + depth dynamic information with a reason- 
able time resolution. The camera sends a modulated optical 
signal to the environment and measures the time of the round 
trip travel of light for each pixel. The result is a depth map, 
which is usually very noisy. The original and denoised depth 
maps are given in Fig. [7^-d. Compared to NL-means, ns-SC 
preserves some fine detail while removing the noise. 

C. Stereo matching 

Finally, we show the results of the two-layer stereo match- 
ing algorithm described in Section IV using two different 



algorithms for solving the MRF in the middle layer: 1) the 
graph cut (GC) algorithm (TJ, and 2) the second order prior 
(20P) algorithm [2]. Although these are not the top performing 
algorithms on the Middlebury benchmark, they are among the 
most widely known, and their code is available online. Since 
our two-layer model does not require a specific optimization 
algorithm for the MRF (e.g., graph cut), an interested reader 
can apply an algorithm of choice. Here, we are primarily 
interested in evaluating the performance improvement obtained 
by adding a second layer of hidden units above the MRF 
solved with the two mentioned exemplary algorithms. 

We have used a dictionary learned with Approach 1 



(Sec. V-A), with 32 x 32 pixel atoms, and dictionary size of 
1024 atoms. The learned dictionary has atoms similar to 
but larger, which leads to increased inference efficiency for 
larger depth maps. 

Fig. [8] shows the performance of GC and GC + ns-SC (our 
two layer model with GC in the middle layer) on the Tsukuba 
stereo pair from the 2001 Middlebury stereo database, which 
does not belong to the training set. The original left image 
from the stereo pair is shown in Fig. [8^ (the right image 
is similar), while the ground truth disparity map is given in 
Fig.(8j). The estimated disparity map using the alpha-beta swap 
graph cut algorithm (TJ, |T9| , |28| , is shown in Fig. [SJ:. It 
uses the data term with equal variances p 2 for single cliques, 
and the Potts energy model for pairwise cliques, given as: 
V 2 (f i ,f j ) = u {itj} T{fi ± fj), where u {iJ} = U(\U - R s \) 
and T denotes the indicator function. The function U is defined 



as: 



U(\Li 



Rj\) = 



2K 
K 



if lit 
otherwise. 



Rj\ < 5; 



The parameter K has been set to 20, which is in the range 
proposed in (TJ. Fig. [8jl shows the estimated disparity map 
using our two layer model. This map was obtained at con- 
vergence after three iterations between the disparity inference 
in two layers. Inference in the bottom layer is done using 
the graph cut with the modified data term and single cliques 



according to Eq. ( 15 ). Since ns-SC assumes uncorrected noise, 



it infers the noise variance in single pixels of the disparity 
map. Therefore, the pairwise cliques are not affected. Even 
though graph cut returns discrete estimates of the disparity, 
this does not change the continuous nature of the optimization 
in the upper layer, as the quantization error is subsumed in 
(fi — fi) 2 . We can see that the two layer model improves the 
graph cut result by correcting 0.37% of pixels. The obtained 
disparity map is also visually improved. Fig. [8^ shows the 
map of pixels modified by adding the upper layer of sparse 
nodes: white pixels denote the correctly modified pixels (from 
erroneous to correct) and black denote falsely corrected pixels 
(from correct to erroneous). Clearly, there are more correctly 
modified pixels, which are mostly located around depth edges. 
This is consistent with the fact that the learned dictionary 
contains oriented edges. 

To demonstrate the generality of the two-layer method, we 
have also tested it using the 20P algorithm of Woodford et 
al. (2), which solves the MRF with triplewise cliques. The 
code and the default parameters have been obtained from 
the authors websit^] After only two iterations between the 
two layers in our model, the percentage of bad pixels has 
been reduced with respect to the 20P algorithm by 1.26% 
on the Teddy dataset (Fig. [9]), and by 0.75% on the Cones 
dataset (Fig. [10]), where the disparity accuracy is set to one 
pixel. The error corresponding to missing pixels (black pixels 
on the ground truth map) has not been taken into account. 
Both datasets are from the Middlebury database and do not 
belong to the training set. Since the 20P algorithm does 
random initialization, we have ran the inference five times, and 
obtained the average improvement of 1.25% for the Teddy set 
and 0.92% for the Cones set. In each run, the two layer model 
consistently outperformed the 20P algorithm. Interestingly, 
the upper layer correctly modified pixels that are mostly 
located on the surfaces and some around edges, but also falsely 



3 http://www.pmdtec.com/ 



modified some depth edges (see Fig. |9p and Fig. [TOp). False 
modification of some depth edges is due to the fact that the 
image segmentation strategy of the 20P algorithm leads to 
better estimation of depth around boundaries, compared to 
just using the depth priors given by the upper layer. This 
underlines the importance of learning the joint statistics of 
depth and intensity, which represents a promising direction of 
future research. 

VI. Related work 

A. Depth map representation and denoising 

Although learning representations of images has been 
widely addressed in the literature in the last few decades (6j- 
p2| , there has not been much work on learning representations 
of depth/disparity maps. The most closely related work to 
the one presented in this paper is the work by Mahmoudi 
and Sapiro, who learn sparse representations of depth maps 
for surface reconstruction (29). Their work differs from ours 
in two aspects: 1) they assume a stationary Gaussian noise 
model; and 2) they learn overcomplete dictionaries per shape, 
i.e., the dictionaries do not generalize to a set of depth maps. 

" |http://www.robots . ox. ac.uk/~oj w/2op/index.html| 
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(a) original (b) NL-means (c) ns-SC (d) variance from ns-SC (e) inpainted ns-SC 



Fig. 6. Denoising results for a depth map from the Purves database with natural noise. Noisy samples appear mostly around depth edges, but there is also 
some correlated noise (big white region). NL means filtering (b) and ns-SC (c) both remove only uncorrected noise, but the NL means filter also undesirably 
smoothens the textured regions. In addition to denoising, ns-SC infers the variance of the noise (d), which points out the noisy samples (non-zero pixels). 
The indication of noisy samples can be used for inpainting (e). 




(a) original (b) denoised ns-SC (c) variance from ns-SC (d) denoised with NLmeans 



Fig. 7. Denoising results for a depth map from the time of flight depth image with the measurement noise. NL means filtering (d) denoises the depth map, 
but smoothens out the fine details. In addition to good denoising, ns-SC (b) infers the variance of the noise (d), which points out the noisy samples (non-zero 
pixels). 




(a) left image (b) ground truth (c) GC (9.58%) (d) GC + ns-SC (9.21%) (e) modified pixels 



Fig. 8. Disparity estimation results on the Tsukuba dataset: a) left image 288 x 384); b) ground truth disparity map; c) disparity estimation result with 
graph cut 1 1 1, percentage of bad pixels: 9.58%, d) disparity estimation result with the two layer (GC+ns-SC) model with the learned dictionary (patch size 
32 x 32), percentage of bad pixels: 9.21%, e) modified pixels: white - correctly modified by the upper layer, black - falsely modified by the upper layer. 




(a) left image (b) ground truth (c) 2QP (9.40%) (d) 2QP+ns-SC (8.14%) (e) modified pixels 



Fig. 9. Disparity estimation results on the Teddy dataset: a) left image (375 x 450); b) ground truth disparity map; c) disparity estimation result with the 
20P algorithm |2 |, percentage of bad pixels: 9.40%, d) disparity estimation result with the two layer (20P+ns-SC) model with the learned dictionary (patch 
size 32 x 32), percentage of bad pixels: 8.14%, e) pixels: white - correctly modified by the upper layer, black - falsely modified by the upper layer. 
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(a) left image 



(b) ground truth 



(c) 20P (12.73%) 



(d) 20P+ns-SC (11.98%) 



(e) modified pixels 



Fig. 10. Disparity estimation results on the Cones dataset: a) left image (375 x 450); b) ground truth disparity map; c) disparity estimation result with the 
20P algorithm | 2 |, percentage of bad pixels: 12.73%, d) disparity estimation result with the two layer (20P+ns-SC) model with the learned dictionary (patch 
size 32 x 32), percentage of bad pixels: 11.98%, e) modified pixels: white - correctly modified by the upper layer, black - falsely modified by the upper layer. 



As we have seen in Section |V| depth dictionary learning 
with stationary additive noise yields inferior denoising per- 
formance compared to the one with a non- stationary noise 
model. Besides the advantage of learning under adaptation to 
noise/unreliabilty of depth maps, our method also infers the 
measure of unreliability of each pixel in a depth map, which 
is extremely useful in applications such as view synthesis (3) 
and stereo matching |I V-B [ 

In the context of depth coding/compression, some re- 
searchers have proposed constructions of transforms that are 
adapted to shape and depth representation. Maitre and Do 
proposed a shape-adaptive wavelet transform that generates a 
small number of wavelet coefficients along depth edges (30) . 
The coding scheme allocates more bits for representing depth 
edges, which are detected by the Canny edge detector. Con- 
struction of piecewise smooth functions, called "platelets" 
represents also an interesting approach for dealing with smooth 
images with sharp boundaries, such as confocal microscopy 
images (31) or depth maps (32). Methods |30] and (33) 
demonstrate efficient coding performance on ground truth 
(not noisy) depth maps. To the best of our knowledge, these 
methods have not been extended to deal with noisy, uncertain 
depth maps usually obtained from stereo matching or time of 
flight cameras. 

In addition to the problem of dealing with unreliable depth 
estimates in image based rendering, denoising of depth maps 
has become of significant interest recently due to the develop- 
ment of the Time-Of-Flight (TOF) cameras. Unlike in standard 
imaging, the noise in depth maps is non- stationary: it has dif- 
ferent statistics for different scene contents. Interestingly, the 
noise variance of depth pixels is inversely proportional to the 
amplitude values of light captured by the sensor pixels (33). 
Edeler et al. used this relation and a non- stationary noise 
model equivalent to ours in order to perform superresolution 
of depth maps (33]. However, their solution of the inverse 
problem assumes known noise statistics particular to TOF 
data, while our approach infers these statistics and thus it 
is more general. The aforementioned relation between noise 
variances and light amplitudes does not hold for the noise 
introduced around depth edges and at close distances, when it 
becomes more of "salt and pepper" nature, i.e., there are depth 
outliers. Most previous work on TOF depth data denoising 
deals with this noise type by first removing the outliers and 
then denoising the depth map using the bilinear filter (34) or 



non-local (NL) means (35) (also see [36] for the application 
of NL-means to laser range data). Prior removal of outliers 
is crucial here, since these would bias the estimate of the 
noise variance for the depth map. Our work does not need 
outlier removal, since those are inferred along the sparse 
approximation algorithm. Moreover, we obtain a quantitative 
estimate of the reliability for each pixel in the depth map. The 
experimental results in Section |V] confirm that our approach is 
superior to NL means filtering using the median filter, which 
is better suited for salt-and-pepper noise and does not smooth 
out the discontinuities in the depth map. Finally, one should 
note that the proposed variance inference represents a general 
way of estimating the noise statistics and it can be used in 
many regularization-based framework for denoising (e.g., in a 
variational formulation of NL means). Such variance- adaptive 
denoising strategy would certainly improve the performance of 
those methods on depth data. It thus represents an important 
contribution to the field of denoising. 

B. Depth from stereo 

As it is a highly ill-posed problem, stereo correspondence 
significantly depends on prior information about the depth 
structure in the scene. The most significant progress in stereo 
matching has been made by utilizing the depth smoothness 
prior. Although the idea that nearby pixels should have similar 
disparities dates back to the seventies (37) , high perfor- 
mance depth estimation algorithms appeared much later with 
the introduction of the piecewise smoothness prior that 
preserves depth discontinuities. Depth estimation algorithms, 
such as the graph cut (T) and belief propagation (T7) , define 
the matching problem as an energy minimization problem, 
where the depth map is modeled as a Markov Random Field 
(MRF) (T6) with single and pairwise clique potentials. Due 
to the significant performance improvements with respect to 
previous approaches, these MRF-based algorithms gained a lot 
of success. During the last decade, many methods based on 
graph cuts and belief propagation have been proposed, which 
attain performance improvements by including additional con- 
straints to handle occlusions 1 38 ]-|40 ] or by performing image 
segmentation during or prior to matching (41). 

Other modifications of MRF approaches include algorithms 
that extend the MRF objective from single and pairwise cliques 
to triplewise (2) and higher order ones (42). Since graph-cut 
algorithms cannot be straightforwardly extended to optimally 
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solve MRFs that include priors on these higher order depen- 
dencies (19), methods (2) and (42) are based on the QPBO (43) 
optimization algorithm. However, QPBO gives suboptimal 
solutions for higher order priors, leaving a certain number of 
pixels unlabeled in estimated disparity maps. Moreover, the 
computational complexity of QPBO increases exponentially 
with the degree of the MRF, and limiting implementations to 
triple- wise cliques. 

Since depth maps of natural scenes contain more complex 
structures that cannot by captured by pairwise or triple- 
wise statistics, it is important to include priors on higher 
order dependencies in stereo matching. The proposed two 
layer approach to depth inference offers an efficient way to 
regularize the solution of stereo matching by utilizing sparse 
priors over learned depth dictionaries. Because of its generic 
formulation, the proposed method can use any of the state of 
the art MRF-based depth estimation methods in the middle 
layer, and obtain an improved depth map solution. The most 
important contribution of the proposed two-layer approach to 
stereo matching is that it can be applied so generally. 

VII. Conclusions 

We have presented a method to learn dictionaries of depth 
features that capture higher-order dependencies in depth maps, 
resulting in oriented depth edges and slanted surfaces. Because 
depth is not a perfectly measurable phenomenon, learning its 
statistics has to be performed under noisy conditions, where 
the type of noise is significantly different than the one usually 
seen in images. Our new sparse coding algorithm explicitly 
takes into account the noisy nature of depth estimates, such 
that the inference and learning can "see through" the noise in 
order to fill in and learn the appropriate structure. Moreover, 
it infers a reliability measure of each sample that can be 
further used in any algorithm having inferred data as input. Our 
denoising results have demonstrated that the depth dictionary 
learned with the new ns-SC method with non- stationary noise 
gives superior performance compared to the state of the art. 

The sparsity prior that enforces higher order dependencies 
is then exploited in a new stereo matching method. We have 
defined a two-layer graphical model where the nodes in the 
middle layer encode disparities and their correlation, and the 
nodes in the upper layer enforce sparse priors. The proposed 
approach is quite general: the inference in the middle layer 
can use any existing MRF-based depth estimation algorithm, 
which combined with sparse inference in the upper layer 
can yield improved performance. The importance of higher 
order dependencies in the depth structure is confirmed by 
the superior performance of the two layer model compared 
to the MRF-based model only, for two different MRF-based 
algorithms. A promising perspective is to use ns-SC to learn 
joint representations of texture (color) and depth. It will also 
be important to go beyond linear generative models to properly 
deal with occlusion in 3D scenes. 
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